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Abstract — Sensing and aggregation of noisy observations 
should not be considered as separate issues. The quality of 
collective estimation involves a difficult tradeoff between sensing 
quality which increases by increasing the number of sensors, and 
aggregation quality which typically decreases if the number of 
sensors is too large. We examine a strategy for optimal aggre- 
gation for an ensemble of independent sensors with constrained 
system capacity. We show that in the large capacity limit larger 
scale aggregation always outperforms smaller scale aggregation at 
higher noise levels, while below a critical value of noise, there exist 
moderate scale aggregation levels at which optimal estimation is 
realized. 

I. Introduction 

This letter presents results which give a new perspective 
on the growing field of sensory data aggregation by clarifying 
fundamental principles of large-scale aggregation. Examples of 
large scale aggregation of observations include astronomical 
observations |[l], biological sensing Q, early detection of nat- 
ural disasters such as earthquakes, tidal waves and floods (3] 
and wireless sensor networks [4]. Errors in observations can 
be reduced by collecting observation data from more sensors . 
However, collecting data from many sensors usually involves 
some cost in terms of system resources, resulting in funda- 
mental tradeoffs jS). The theoretical understanding of these 
tradeoffs in natural and engineered systems is now a high 
priority. 

An important fundamental problem in this field is the 
problem of aggregating independent observations of the same 
phenomenon with a resource constraint. Previous works have 
analyzed the tradeoff behavior between aggregate data rate 
and sensing error from the fundamental view of information 
theory. The analysis has been extended to include the situation 
where arbitrarily large numbers of samples can be collected 
by reducing the data aggregated from each sample using lossy 
data compression. However, so far results have only been 
obtained for the fundamental information theoretic bounds 
with infinitely many sensors [6i, |i7J, or specific situations in 
which the number of sensors is fixed |I8 |. The previous works 
do not include the situation where the number of observations 
can be varied, and thus the results are not sufficient to support 
our understanding and design of real world systems. 

In this paper we introduce a modification of the common ba- 
sic model for data aggregation with compression which makes 



it more tractable and amenable to analysis when the number 
of sensors can vary. Specifically, we consider independent 
decompression of each observation in a discrete version of 
the CEO problem 1^. We show that this model reveals a new 
property, the existence of noise threshold beyond which large 
scale aggregation is superior to lossless aggregation with no 
compression. This can be seen as a manifestation of "more 
is different" in sensor networks |l9]. Moreover, we show that 
universal results for scaling behavior of collective estimation 
error can be obtained by considering asymptotic behavior 
when the system capacity diverges to infinity. 

In this paper, we consider a fundamental formulation of the 
problem with only one information source and suppose that 
all sensors are symmetrical, i.e., exchangeable with respect 
to their contributions to the final result of aggregation. This 
allows us to treat the problem in terms of the theory of large 
deviations. The paper is divided into 5 sections. Section |ll] 
presents our system model. Section |lll] briefly summarizes 
our main result. The proof for the proposition, however, is 
postponed until the following Section |IV] Discussions are 
given in Section FVl 

II. System Model 

Now we start by introducing our system model for large- 
scale aggregation of independent noisy observations. Notice 
that we explicitly consider a capacity constraint. This section 
briefly summarizes the optimal strategy for the case of redun- 
dantly observing a Bernoulli (1/2) sequence with very many 
sensors. 

A. Ensemble of Independent Sensors 

We consider that an observer is interested in observing a 
purely random source X, the state of which can be represented 
by a series of Ising variables and their realizations are 
explicitly denoted by the lower case letters a;^ = ±1. We 
assume that this observer can not directly observe the source. 
Instead, he deploys a collection of L sensors, labeled by an 
index a, to independently observe the source and report the 
results of their observations over a communication network. 
Assuming a certain level of environmental noise, the individual 



observations Yfj_{a) could be different for each sensor We de- 
fine a common level of noise p e [0, 1/2) for our observations 

{S{X,,~Y^{a))) 

with Kronecker's delta S, where the braket ( • ) denotes the 
expectation of an argument. 

Then we suppose that each sensor can compress 
(i.e., lossy encode) if necessary, its sensor readings 
y{a) — (yi(a),--- ,j/M(a)) into a codeword z{a) — 
(zi(a),--- ,ZN{a)) independently. In this paper, we assume 
that the codeword is represented by a series of Ising variables 
Z^{a) and thus their realizations are restricted to Zi/(a) = ±1 
as well. We further assume that the sensors themselves can not 
share any information about their observations. That is, they 
are not permitted to communicate with each other to decide 
what to send beforehand. As a result, the observer must collect 
the L codewords from all the sensors, each of which separately 
encodes its own observations y^(a), and use them to estimate 
the original for fi = I,-- - ,M. We assumed here that 
the lengths of the codewords are the same N, so that all the 
sensors are identical with respect to the ability of encoding 
their observations. That is, regardless of the sensor label, the 
rate for the lossy encoding is given by i? = N/AI. Therefore, 
the load level of our network can be measured by the sum rate 
LR, which should not be greater than the network capacity 
given by, say, C. We assume that C is a given integer, not a 
real, in which case our argument will be greatly simplified. 

If the sensors were able to share information about their 
observations before reporting to the observer, then they would 
be able to smooth out their independent environmental noises 
entirely as the number of sensors L diverges. Then the 
observer can figure out all the realizations of if the network 
capacity C exceeds 1, which is the entropy rate of the source 
X. However, if the mutual communications are prohibited, 
there does not exist any finite value of C for which even 
infinitely many sensors can transmit all the information [6l. 
Therefore, our goal should be the semifaithful reconstruction 
of the original given the codewords Ziy(a) under a certain 
fidelity criterion. 

B. Exchangeable Sensor Ansatz 

Suppose that y{a) = {yi{a), ■ ■ ■ , yA/(a)) be best reproduc- 
tions for the observations obtained by using the codewords, re- 
spectively. Assume that the distortion between two sequences 
are always measured by the Hamming distance per symbol. 
Then it is easy to see that the distortion is given by, in this 
case, 

M 

d{y{a),y{a)) = (1/M) ^ S{y^{a), y^(a)) 

for a — ,L. Since we have exchangeable sensors as 

stated, we can impose that 

{diYia),Y{a)))^D 



for any given pairs. With this Hamming distortion constraint, 
the lower bound on the rate R{D) required to describe a 
variable Y^i{a) is given by 

R{D) = I - H2{D) , 

where H2{D) denotes the binary entropy function [10|. This 
is called the rate distortion function for the Bernoulli (1/2) 
source. 

The observer then collects all the transmitted information 
Ziy(a) to calculate the estimate a;^; for the fith symbol of the 
unknown x = {xi, • • • , xm)- To go further, we now restrict 
ourselves to the case of 

{S{Y^{a),~Y^{a)))^D . 

That is, every variable Y^{a) in the reproductions is expected 
to have the same error probability D. Notice also that the three 
variables X^, Y^{a), and Y^{a) form a Markov chain, when 
the best estimator for is l^(a) if < p,D < 1/2 holds. 
Then it is straightforward to get, independently, 

(<5(X^,-y^(a))) = 

where 

p = pil-D) + {l-p)D 

represents the combined error probability for replacing the 
original x^ by the available symbols y^{a). In other words, 
the error indicator function 5{X^, — Y^{a)) reduces to the 
Bernoulli random variable that takes the value 1 with proba- 
bility p for a = 1, • • • ,L. 

C. Bayes Optimal Estimator 

Now let us consider the most probable realization of X^ 
given a set of evidences y^ ~ (y^(l),--- ,y^{L)). Since 
6{X^, ~'Y^{a)) obeys the Bernoulli statistics, it is easy to see 
that the majority vote procedure gives the best strategy [11 J. 
That is, the optimal estimator should be a mapping 

X^^sgn{j2^,ia)\ . 

Then overall error probability for the estimate f ^ is mini- 
mized. The probability of getting more errors than L/2 out of 
L Bernoulli trials is given by 

P{Xp + Xp) 
= {S{X^,-X^{a))) 

_ fEf=^Qp('|i), (Lis odd) 

^\Et^\iQpm + hQpil\L) (Lis even) ' 
where 

denotes the binomial distribution. In principle, we may choose 
whatever value of L which is compatible with the sum rate 
constraint of LR < C. To minimize the error probability for 
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Fig. 1 . Optimal aggregation levels for ensemble of independent sensors in 
noisy environment. The p is a given noise level. The solid line denotes the 
optimal data rate R* per sensor, which maximizes the exponential decay rate 
Ip{R) of vanishing error probability with increase of system capacity C. For 
compaiison, the dashed line represents the most pessimistic value which 
minimizes Ip{R). 



the estimator X^, however, we should use the largest possible 
value. Hereafter we assume that L denotes the largest possible 
value. In particular, suppose that the sensors do not encode 
their observations. Instead, each sensor simply sends the whole 
information of the noisy F^(a). Then, the error probability for 
the estimator reduces to 



Ei=c^iQpii\c) + kQpi^\c) 



(C is odd) 
(C is even) 



III. Statement of Results 

An exact formula on the optimal data rate for individual 
sensors is presented in this section. By using the notion of 
large deviations an optimality measure for the data aggregation 
tasks is introduced. Numerical analysis of our exact result 
provides insights on the nature of large-scale aggregation in 
sensing systems, natural or engineered. 

A. Optimality Measure 

Assume that a network capacity C is given. Consider that 
the common data rate R is first allocated to all the sensors. 
The number of sensors L is thus determined as the maximum 
value of L satisfying the sum rate constraint RL < C. In our 
system model, it is obvious to say that P{X^ ^ X^) — > 
as C — ^ oo. As is shown in Section IIVI it is not hard to 
refine the above statement of convergence and to prove that 
P{Xf^ 7^ Xfj^) decays to exponentially fast as C — J' oo. 
By analogy with large deviation theory I.12J . we define the 
exponential rate of decay by 

Ip{R) = - lim i liiP{X^ ^X^) (0 < i? < 1) . 



The decay rate Ip{R) describes the limiting behavior of the 
system from a macroscopic level, on which the rate R could 
be used as a control parameter [13 j. The case of i? = 1 
reduces to a naive aggregation scheme in which the sensors 
just send their noisy observations to the observer For this 
smallest aggregation, we aggregate data from only L — C 
sensors. Hereafter, we call this scheme the level- 1 aggregation. 
For a given R > 0, the level-i? aggregation is defined in 
which every sensor encodes its observations at the rate of R 
independently. As an extension of the definition of Ip{R) for 
i? > 0, we could naturally define the level-0 decay rate as 

lp{0) = - lim 1 lim lnP(X^ ^ X^) . 

C^oo O it— >0 

B. Large Deviations Result 

Assume that D{R) denotes the distortion rate function, 
which is the inverse function of R{D). Suppose that 

Pp{R)^p[1-D[R)) + [1-p)D{R:) . 

Then, for < i? < 1, the main result of this paper is given 
below. 

Proposition 1: We have 



\npp{R) Hl^pp{R)) \ 
2 2 J 



(1) 



The maximum of Ip{R) is of great interest from an engi- 
neering point of view. That is, we prefer larger values of 
Ip{R). Therefore, we examine the optimal levels defined by 
R* = argmaxg<^<]^/p(i?). The optimal aggregation, for a 
given p, is called the level-i?* aggregation. 

C. Numerical Findings 

We now examine the behavior of formula ([TJ which gives 
the optimal levels R* for the noise p. As is seen in Fig. [T] the 
optimal aggregation scale diverges, i.e., the optimal data rate 
R* per sensor diverges for noise levels larger than the critical 
point po = 0.211. In this noisy region, we want the system 
to be as large as possible. The larger the system we have, 
the smaller the error probability. By definition, the optimal 
aggregation is said to be level-0. In contrast, we can always 
find the non-zero optimal levels below pq. In particular, if the 
noise level is below pi = 0.024, our investigations indicate 
that the level- 1 aggregation is optimal. Moderate aggregation 
levels could be optimal in the intermediate noise levels be- 
tween the two critical points. It is also worth noticing that the 
behavior of R* of p is reminiscent of that of order parameters 
at a continuous phase transition in statistical mechanics |14|. 
The analytical results presented here are also consistent with 
numerical simulations for the system size C ~ 50, as shown 
in Fig. |3] 

Since the optimal levels R* are unique values for each noise 
p, we can plot the optimal decay rate Ip{R*) as is given in 
Fig. 12] The optimal rate Ip{R*) describes the limiting behavior 
of the smallest error probability P{X^ ^ X^) in terms of 
macroscopic variables. Clearly, it is a strongly decreasing 
function of the noise p. 




Fig. 2. Maximum and minimum decay rates for vanishing error probability 
of final decision. The solid hne denotes the largest decay rate Ip{R*) at the 
noise level p, which is given by the optimal data rate R* per sensor. For 
comparison, the dashed hne represents the smallest decay rate Ip{R^) which 
is given by the most pessimistic value R^ . 



IV. Analysis 

This section is devoted to present the large deviations 
analysis which gives Proposition [T] and to describe briefly 
how it relates to the previous work by using the Gaussian 
approximation ifTSl . Numerical experiments support our recent 
result. 

A. Gaussian approximation 

For sufficiently large L, the binomial distribution Qp{l\L) is 
well approximated by the Gaussian distribution N{Lp, Lp{l — 
p)) with mean Lp and variance Lp{l — p) LISJ . Changing the 
variable 



l-Lp 



^Lp[l - p) 
enables us to use a naive approximation to get 



where we denote, respectively, 



Ai 



ds 



(2) 



Ai 



1/2 -P 



Since every sensor can achieve the optimal rate R{D), we may 
evaluate the number of sensors L as C/R{D). 

Assume that D{R) denotes the distortion rate function, 
which is the inverse function of R{D). Suppose that a(p, R) — 
{l-2p){l-2D{R))for{) < p < 1/2 and < i? < 1. Together 
with an identity 



p = (1 - 2p) 



D 



we have estimated the rate function as 

(1 - 2pf\n2 
h{R)={ a{p,Rf 



(i? = 0) 
(0 < i? < 1) 



,2i?(l-a(p, R)){l + a{p, R)) 

However numerical evidence does not support the above 
formula, i.e., the Gaussian approximation (|2). This motivates 
us to apply the standard large deviation analysis, as shown 
below. 

B. Large Deviation Analysis 

Write the error indicator function — y'^(a)) as Z^{a). 

For a given p, this is a Bernoulli random variable that takes 
the value 1 with probability 

p^p{\^D) + {l-p)D 

for a = 1, • • • , L. Consider the sample average defined to be 



a=l 



Since the expectation {Z^{a)) ~ p is finite, we know that Af^ 
is approaching p by the law of large numbers. However the 
value of interest is the error probability P{X^ ^ X^) for the 
majority vote procedure, which is identical to P{M^ > 1/2). 
For < p < 1/2 and thus |p - 1/2| > 0, the vanishing 
P{^^ii ^ 1/2) is called a large deviation probability. 

Consider the rate function of Z^{a). Since Z^{1), ^^(2), . . 
. , Z^{a) are the L independent Bernoulli (p) random variables, 
the Legendre transform gives the rate function Ip^\z) for the 
sample average as 

iW fz) ^ z\n- + (I - z)ln^—-^ 
p 1- p 

for < z < 1 and oo otherwise (12]. Since the number of 
sensors L is given by C/R, changing the variable 

LR 

yields 

Since the set — {M^ > 1/2} is closed and does not 
contain p, the large deviation property tells that 



1 



Jim -lnP(A,)^-min /«(z) 



Then it is an easy matter to check that 
min 



mm (z) = (1/2) 



1 [, . , lnpp(i?) Hl-pp{R)) \ 
r\ 2 2 J ■ 

Write Ip{R) — /p^]j.(l/2) for the convenience. For a given R 
we conclude that 

/p(i?) = - lim ^lnP(^^) . 

This completes the proof for Proposition [T] 




Fig. 3. The solid line denotes the optimal data rate R* given by the large 
deviation analysis, while the dashed line represents the same value calculated 
by the Gaussian approximation in the prior work. The circles indicate the 
numerical experiments for C = 50. 
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V. Discussion 

It has been shown that the optimal aggregation for an 
ensemble of independent sensors exhibits a critical behavior of 
the data rate per sensor R = C/L with respect to the external 
noise level p. The simple analytic model shows that in the 
high noise region beyond a critical value of noise po, the data 
rate R should converge to zero in order to reduce collective 
estimation error. This means that we should deploy very many 
sensors L ^ C in the large C Umit. In contrast, if the noise 
level is lower than the critical point, the data rate R should take 
a positive value. In this case, the number of sensors scales as 
L = 0{C). Numerical evidence supports our large deviation 
analysis for the optimaUty measure. 
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